silver label
Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings
Hong, Yunjing, Nelson, Jennifer C., Williamson, Brian D.
Accurately identifying patients with specific medical conditions is a key challenge when using clinical data from electronic health records. Our objective was to comprehensively assess when weakly-supervised prediction methods, which use silver-standard labels (proxy measures of the true outcome) rather than gold-standard true labels, perform well in rare-outcome settings like vaccine safety studies. We compared three methods (PheNorm, MAP, and sureLDA) that combine structured features and features derived from clinical text using natural language processing, through an extensive simulation study with data-generating mechanisms ranging from simple to complex, varying outcome rates, and varying degrees of informative silver labels. We also considered using predicted probabilities to design a chart review validation study. No single method dominated the other across all prediction performance metrics. Probability-guided sampling selected a cohort enriched for patients with more mentions of important concepts in chart notes. SureLDA, the most complex of the three algorithms we considered, often performed well in simulations. Performance depended greatly on selected tuning parameters. Care should be taken when using weakly-supervised prediction methods in rare-outcome settings, particularly if the probabilities will be used in downstream analysis, but these methods can work well when silver labels are strong predictors of true outcomes.
- North America > United States > Oklahoma > Payne County > Cushing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Alaska (0.04)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
Interpretable phenotyping of Heart Failure patients with Dutch discharge letters
Torri, Vittorio, Boonstra, Machteld J., van de Veerdonk, Marielle C., Kalkman, Deborah N., Uijl, Alicia, Ieva, Francesca, Abu-Hanna, Ameen, Asselbergs, Folkert W., Calixto, Iacer
Objective: Heart failure (HF) patients present with diverse phenotypes affecting treatment and prognosis. This study evaluates models for phenotyping HF patients based on left ventricular ejection fraction (LVEF) classes, using structured and unstructured data, assessing performance and interpretability. Materials and Methods: The study analyzes all HF hospitalizations at both Amsterdam UMC hospitals (AMC and VUmc) from 2015 to 2023 (33,105 hospitalizations, 16,334 patients). Data from AMC were used for model training, and from VUmc for external validation. The dataset was unlabelled and included tabular clinical measurements and discharge letters. Silver labels for LVEF classes were generated by combining diagnosis codes, echocardiography results, and textual mentions. Gold labels were manually annotated for 300 patients for testing. Multiple Transformer-based (black-box) and Aug-Linear (white-box) models were trained and compared with baselines on structured and unstructured data. To evaluate interpretability, two clinicians annotated 20 discharge letters by highlighting information they considered relevant for LVEF classification. These were compared to SHAP and LIME explanations from black-box models and the inherent explanations of Aug-Linear models. Results: BERT-based and Aug-Linear models, using discharge letters alone, achieved the highest classification results (AUC=0.84 for BERT, 0.81 for Aug-Linear on external validation), outperforming baselines. Aug-Linear explanations aligned more closely with clinicians' explanations than post-hoc explanations on black-box models. Conclusions: Discharge letters emerged as the most informative source for phenotyping HF patients. Aug-Linear models matched black-box performance while providing clinician-aligned interpretability, supporting their use in transparent clinical decision-making.
- Europe > Netherlands > North Holland > Amsterdam (0.26)
- North America > United States (0.14)
- Europe > Italy > Lombardy > Milan (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
Reduction of Supervision for Biomedical Knowledge Discovery
Theodoropoulos, Christos, Coman, Andrei Catalin, Henderson, James, Moens, Marie-Francine
Knowledge discovery is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive and time-consuming and hinders scalability when exploring new domains. In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods' ability to learn from data with noisy labels. The evaluation on biomedical benchmark datasets explores the effectiveness of the methods. Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision. By gradually decreasing supervision, we assess the robustness of pointwise binary classification techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, suggesting an encouraging direction toward adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.
- Europe (1.00)
- North America > United States > Minnesota (0.28)
- Asia > Middle East > UAE (0.28)
- Overview (0.93)
- Research Report > New Finding (0.46)
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Lee, Beomseok, Gaido, Marco, Calapodescu, Ioan, Besacier, Laurent, Negri, Matteo
As in any data-intensive domain, collecting highquality To fill this gap, this paper explores the use datasets is a fundamental and costly prerequisite of SFMs to automatize the validation of crowdsourced for the development of speech-processing speech data. To this aim, we investigate the applications. Traditional methods heavily rely on employment of off-the-shelf SFMs such as Whisper human workforce, whose costs, as data collection and SeamlessM4T (Radford et al., 2022; Communication scales, are hard to sustain. In the quest for scalable et al., 2023), along with machine translation solutions to tackle this problem, crowdsourcing (MT) models and grapheme-to-phoneme conversion emerged as a viable option that also enables the coverage (G2P). Through experiments on French, of diverse populations (Cefkin et al., 2014; German, and Korean data, we test the integration Poesio et al., 2017). Due to the variable quality of of SFMs and crowdsourcing to reduce validation crowd-sourced data, validation methods that discard costs while preserving final data quality. Our results low-quality contributions are essential to build show that leveraging SFMs yields a cost reduction reliable datasets (Negri et al., 2011; Sabou et al., by over 40%, while maintaining high data quality, 2014; Chittilappilly et al., 2016). This need is exacerbated significantly improving the efficiency and scalability in the collection of speech-text pairs, where of crowd-sourced speech data collection.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Europe > United Kingdom > Scotland (0.04)
- (4 more...)
- Information Technology > Communications > Social Media > Crowdsourcing (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
Multilingual Non-Factoid Question Answering with Silver Answers
Mishra, Ritwik, Vennam, Sreeram, Shah, Rajiv Ratn, Kumaraguru, Ponnurangam
Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98\% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80\% and 72\%, as well as a macro F1 of 72\% and 66\%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India (0.05)
- Europe > United Kingdom > Scotland (0.04)
- (15 more...)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Language Model as an Annotator: Unsupervised Context-aware Quality Phrase Generation
Zhang, Zhihao, Zuo, Yuan, Lin, Chenghua, Wu, Junjie
Phrase mining is a fundamental text mining task that aims to identify quality phrases from context. Nevertheless, the scarcity of extensive gold labels datasets, demanding substantial annotation efforts from experts, renders this task exceptionally challenging. Furthermore, the emerging, infrequent, and domain-specific nature of quality phrases presents further challenges in dealing with this task. In this paper, we propose LMPhrase, a novel unsupervised context-aware quality phrase mining framework built upon large pre-trained language models (LMs). Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT (coined as Annotator). In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs. As a result, they bring distinct advantages in preserving informativeness, concordance, and completeness of quality phrases. Secondly, training a discriminative span prediction model heavily relies on massive annotated data and is likely to face the risk of overfitting silver labels. Alternatively, we formalize phrase tagging task as the sequence generation problem by directly fine-tuning on the Sequence-to-Sequence pre-trained language model BART with silver labels (coined as Generator). Finally, we merge the quality phrases from both the Annotator and Generator as the final predictions, considering their complementary nature and distinct characteristics. Extensive experiments show that our LMPhrase consistently outperforms all the existing competitors across two different granularity phrase mining tasks, where each task is tested on two different domain datasets.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Beijing > Beijing (0.04)
- (17 more...)
Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP
Du, Wei, Advani, Laksh, Gambhir, Yashmeet, Perry, Daniel J, Shiralkar, Prashant, Xing, Zhengzheng, Colak, Aaron
More recently, (Fu et al., 2023) natural language processing (NLP) tasks using creates a meta-model responsible for predicting the latest generative pretrained models such as the accuracy of the LLM model using the model's GPT (OpenAI, 2023; Ouyang et al., 2022), PaLM confidence scores as features. Methods from the (Chowdhery et al., 2022), and many others (Touvron computer vision (CV) domain to assess unlabeled et al., 2023; Bai et al., 2022; Penedo et al., data more generally have, for example, proposed 2023; Taori et al., 2023). This new generation of the average threshold confidence method that learns models opens up many new possibilities including a threshold over the model's confidence, predicting competitive performance in zero-shot and few-shot accuracy as the fraction of unlabeled examples settings for tasks that have typically been modeled exceeding that threshold (Garg et al., 2022), or iteratively using a supervised setting (OpenAI, 2023). More learn an ensemble of models to identify established language models (BERT (Devlin et al., misclassified data points and perform self-training 2019), RoBERTa (Liu et al., 2019), XLM-Roberta to improve the ensemble with the identified points (Conneau et al., 2020b), etc.) provide a strong balance (Chen et al., 2021). However, the metrics and hyperparameters of inference cost and task performance for in previous works are specifically for such systems. This broad class of large language classification tasks and cannot be easily extended models (LLMs) used for complex supervised NLP to more complex tasks.
Fine-grained Contrastive Learning for Relation Extraction
Hogan, William, Li, Jiacheng, Shang, Jingbo
Recent relation extraction (RE) works have shown encouraging improvements by conducting contrastive learning on silver labels generated by distant supervision before fine-tuning on gold labels. Existing methods typically assume all these silver labels are accurate and treat them equally; however, distant supervision is inevitably noisy -- some silver labels are more reliable than others. In this paper, we propose fine-grained contrastive learning (FineCL) for RE, which leverages fine-grained information about which silver labels are and are not noisy to improve the quality of learned relationship representations for RE. We first assess the quality of silver labels via a simple and automatic approach we call "learning order denoising," where we train a language model to learn these relations and record the order of learned training instances. We show that learning order largely corresponds to label accuracy -- early-learned silver labels have, on average, more accurate labels than later-learned silver labels. Then, during pre-training, we increase the weights of accurate labels within a novel contrastive learning objective. Experiments on several RE benchmarks show that FineCL makes consistent and significant performance gains over state-of-the-art methods.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)